DOCOHEKT PvESOtfE 



tV 123 260 



TH 005 320 



AtJTHCa 
TJTLZ 

PUB BATE 



Mors^?, David T.j horse, Lifida 

A Vlodel for Assessing the Effects of Departures from 
ideality in Performance Testing, 
t ipr 76 3 

27p.j Paper presented at the Annual fleeting of the 
American Educational Research Association (60th, San 
Francisco, Califortiia^ April 19-23^ 1976) 



EDBS PBICE 
DESC?.i:*TOTlS 



IDEHTIPIERS 



MF-$0.83 HC-$2.06 Plus Postage. 

Cost Sffectivenessj Dscisior: MaUng; <"Hdth«maticAl 
Models; Heasurenent Techniques; ^Perf^rn^noe Tests; 
Statistical Analysis; *Test Constructioit; Test5,ng 
Probleit!:^; *Test ReliatiXity; *iest ValidHy; True 
Scores 

♦Generalizability Theory 



ABSTBACT 

Performance testing oft«u entails tiis usage of 
expens.lve, tine-consuming aeasure53 in the quest for determining the 
level of performance 0:1 some desireOi behavior* It is concluded that a 
generalizability theory approach to dealing with departures from 
reality iii testing can aid in the establishmert of empirically-based 
choices of measucement strategies* This paper presents a model for 
assessing the loss of information due to using a measure vhich may be 
less realistic* but more feasible* than the desired behavior. The 
method is based on the concept of generalizability theory. An example 
is inclu;led along vith a brief discussion of relevant considerations 
in performance testing, a background on generalizability theory, and 
a discussion on decision-making. (Author/BEP) 



* Documents acquired by ERIC include itany informal unpublished * 

* materials not available from other sources, SSIC makes svery effort * 

* to obtain the best copy available. Nevertheless » items of marginal * 

* reproducibility ar*^ often en'jountered and this affects the quality ♦ 

* of the microfiche and hardcopy reproductions ERIC makes available * 

* via the ESIC Bocumerit Reproduction Service (EDJlS), EDBS is not * 

* responsible for the quality of the original docu^tento Reproductions * 

* supplied by EDBS are the best that can be made from the original. 



ERIC 



o 

O 



A Model for Assessing the Effects of Departures 
from Reality in Performance Testing 

David T. Morse and Linda W. Morse 

Career Education Center 
Florida State University 



RiOMtep MMEftlM. Hki BEEN GRANtEC 3t 



unoen xcfleeMe^:TSV^»TH the hx^onxl »h. 

SHIUTE Of EOOC*ftOtl fURTHtn HlPnO- 
DUCTIOH OUTSlOt THE EfllC SVSKM 
OuiRES PEPMISSIOM Of THE COP^fltGhl 

own Eft 



[>eP*"^'*e«TOF HEALTH, 

EOUC*TlOMAWeLF*ftE 

rt*T»Ort*Lm*TlTl>TeOF 

eouc*TioM 
tmi oocweHT " oeceiveo from 

^OUCMrOM MS. HONOR POl-'C^ 



CO 

o 



ERIC 



A paper presented at the annual meeting of the American Educational 
Research Association, San Francisco, California, April 19-22, 1976. 



A Model for Assessing the Effects of 
Departures from Reality in Performance Testing 



David T. Morse and Linda W. Morse 

Career Education Center 
Florida State University 

ABSTfy\CT 

Performance testing often entails the usage of expensive, 
time-consuming measures in the quest for detRnsrining the level 
of performance on some desired behavior. This paper presents a 
model for assessing the loss of information due to using a measure 
which may be less realistic, but more feasible, than the desired 
behavior. The :::ethcd is based on the concept of general liability 
theory. An example is included along with a brief discussion of 
relevant considerations in performance testing, a background on 
generalizability theory, and discussion on decision-making. 
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A Model for Assessing the Effects of 
Departures from Reality in Performance Testing^ 

The ideal measurement strategy in a performance-based learning 
situation is to have the learner attempt the desired behavior by 
demonstrating his competence in a performance setting. For instance, 
if the behavior is to "successfully overhaul and rebuild a V-8 en- 
gine," or to "successfully navigate on land from an unfamiliar point 
to base using only compass and relief map," the most desirable per- 
formance test for the first would be to supply an automobile with 
an engine in need of overhaul and supply the required tools and 
equipment. For the second example, the potential navigator should 
be placed in unfamiliar surroundings with only compass and relief 
map. Certain overriding considerations, however, may prohibit use 
of such direct measures of performance. For both examples, fac- 
tors such as lack of time, money, equipment, and supervisory pe onnel 
might dictate that the test actually used be a measure as indirect 
as a short paper-and-pencil test covering selected aspects of en- 
gine rebuilding or land navigation. Clearly, this is not as desirable 
as use of the direct measure of performance. When decision-makers 
are confronted with the necessary use of a less direct measure of 
performance, however, how should they select which one to use, and 
how much loss of fidelity to the actual behavior must they accept? 
These questions should be asked and answered in situations where 
a less direct measure of performance is being utilized. 

The purpose of this paper is to present a method of assessing 
the effects of departures frcrni reality using a generalizability the- 
ory approach. Once the effects of changes in fidelity have been de- 
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termined, instructional designers and/or measurement specialists 
have a basis for the rational selection of measurement strategies. 
Further, if costs of testing and costs attached to losses of infor- 
mation concurrent with departures from reality can be determined, 
the selection of a measurement strategy can be based on a cost- 
effectiveness decision. 

The remainder of this paper is divided into three sections: a 
brief discussion on contraints and fidelity in performance testing; 
a brief background on the rationale and mechanics of generalizability 
theory; and the methodology for the model, along with an example 
and discussion on decision-making. 

Preliminary Considerations 
Constraints in Performance Testing 

A method for determining losses due to departures from reality 
in performance testing has not been adequately explored. This is 
an extremely crucial issue since the basic protiise of performance 
testing lies in the measurotient of presumably actual behaviors. 
Therefore, the only perfectly valid performa.^ce test would be one 
involving observation of the student's natural behavior. This would 
prove impossible in all but a few situations, Lindquist (1951) has 
identified several difficulties in direct measurement. First, the 
nature of the objective which is being assessed often makes it im- 
possible to measure. Many objectives fn the affective domain are 
examples of this situation. Secondly, a natural series of events 
may not be easily observable or either maybe inaccessible. Third, 
observing some behaviors may be exceedingly difficult or impossible 
because of the relative infrequency of occasions when the behavior 



is naturally elicited* Lindquist also points out the problem of 
lack of comparability in accessible behavior samples for different 
students* Perhaps one of the most constraining obstacles in making 
direct measures lies in the difficulty of constructing such mea- 
sures* Even simple performance objectives may yield complex be- 
haviors which must be analysed in order to develop appropriate and 
valid performance tests* The additional effort required for de- 
signing performance tests means their development is more costly 
in time and energy, yet they still may be plagued by one or more 
of these problems* 

Lindquist outlines four basic types of tests: Ca) giving the 
learner the opportunity on special occasion to perform the behavior 
specified 1n the objective; (b) having the student exhibit behavior(s) 
similar to the specified performance, making the assumption that a 
relationship exists between the behaviors desired and elicited; (c) 
giving the student a situation in which the desired behavior would 
be necessary and asking what should be done and/or hew he would 
do it; and (d) testing the student on his or her knowledge of facts, 
rules* principles, etc* which are necessary for successful demonstra- 
tion of the desired performance* These test types parallel succeeding 
levels of reality* For this paper, these four test types will be 
considered as: (a) actual; (b) simulated; (c) verbal; and (d) 
subordinate knowledge or skills* These last three types of measures 
represent departures from reality* 

Although for many situations it is difficult to attempt to elicit 
the actual behavior, sometimes it is possible to do so* In Lindquist's 
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identical elements test, the elements of the actual performance must 
be-identical to the critical elements in the criterion behavior even 
though they may be differently distributed in the natural or criterion 
situations. An example of this vwuld be the applicant for a clerk/ 
typist position who is asked to type a business letter as part of 
the job application process. This letter may differ in degree of * 
difficulty from t^ijse typically typed on the job but the critical 
aspect of typing a letter in a business format is identical. 

The second kind of test is the simulation. In a simulated test, 
the elements should be substantially related to the actual desired 
behavior. There should be considerable relationship between the 
elements in the simulation and in the actual test. This kind of 
test could be illustrated with the example of pilot simulator train- 
ing machines. 

The verbal description test type requires the student to respond 
to a situation based on how he would or ought to behave. The pre- 
sentation of the situation may be oral or written and the pattern 
for response may vary from free response to selection between alter- 
native answers. An example of this wotild be the vocational student 
who is presented with a situation describing a stalled car and is 
asked how he would diagnose the problem. 

The fourth test type requires the student to exhibit his com- 
petence in a particular subject by demonstrating mastery of perti- 
nent facts* rules* principles, etc. Although least desirable of the 
four types of tests in terms of fidelity, this format has beer; the 
most widely adapted type for measuring educational achievement. 
However* the demonstration of prequisite knowledge of a behavior 
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is not a sufficient condition for exhibiting a desired behavior due 

to. large disparity between the two conditions* 

Fidelity 

Fidelity must be considered as the test designer moves from the 
real world to a simulated test environment. Fidelity is defined in 
terms of the degree of relationship betv/een the real situation and 
the test conditions. This relationshipis not entirely dependent on 
the face validity of the test conditions but instead depends on how 
well the skills and knowledge exhibited in the simulated (and lower) 
testing conditions transfer to the real world behavior (Branson, 
Rayner, & Epstein* lS/4). However, one expects high fidelity when 
the test situation incorporates the highest level of reality possi- 
ble Ci*e., actual, simulated) and low fidelity with lower levels of 
measurement reality* 

A valid performance test has been assumed to be one which has 
complete fidelity and comprehensiveness (Fitzpatrick J Morrison, 1971). 
But as tests more closely approximate the actual behavior they be- 
comes harder to control because of the difficulty of observing the 
students under the same conditions* This difficulty over control 
leads to less reliable measures* Thus, it would appear that the 
more closely a test approximated the actual performance the dif- 
ficulty with controlling the situation could cause a loss of reli- 
ability* This apparently paradoxical situation means that the de- 
pendability of different performance test scores from tests pur- 
porting to measure the same behavior may differ* Hence, the need 
for empirical determination of the interaction of degree of task fi- 
delity and reliability of scores cannot be overemphasized if decisions 
are to be made from performance test results* 
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Rationale and Mechanics of Generalizability Theory 

Classical Concep^g of Reliability 

The rationale underlying the concept of generalizability theory 
can be more easily understood in one is familiar with the classical 
notion of reliability. This definition of reliability of measurement 
is that of consistency or stability of a set of test scores* The 
important question in this definition lies in how the dimensions of 
score stability are interpreted in the traditional estimates of 
reliability. Before discussing the traditional reliability estimatv.'S, 
an overview of the types of variability vjhich can affect test results 
should be outlined. Thorndike (1951) outlined the following: 

1) Lasting and general characteristics of the individual 

(e.g., general level of intellect, ability to understand instructions) 

2) Lasting but specific characteristics of the individual 

(e.g., knowledge of the s^jbject specific to a set of test items) 

3) Temporary but general characteristics of the individual 
(e.g., general state of health, fatigue, etc.) 

4) Temporary but specific characteristics of the individual 
(e.g., subject interaction with a certain item or set of items) 

5) Systematic or chance factors affecting the administration of 
the test or appraisal of test performance 

(e.g., noisy conditions for taking a test> a grader being given 
an incorrect answar key, etc.) 

6) Chance or random variation 
(e.g., lucky guessing) 
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Dependirig on what is being measured, the sources of variation that 
?,hould be accounted for should differ* Sources of variation included 
in the scores, but not measured as ''true" variation introduce error 
into the measurement process* 

Two traditional estimates of reliability* coefficient alpha and 
KR-20, are popular internal-consistency indices which tap the third 
source in Thorndike's list* That is, they reflect the degree to 
which a person's performance is consistent over a single set of 
items* Note that the variability attributable to sources 1 or 2 
cannot be assessed using alpha or KR-20* Also, sources 1, 2, 4, 5, 
aiid 6 will be present in the set of scores » but alpha and KR-20 
cannot detect them* Test-retest reliability considers the stability 
of scores across administrations of similar or alternate forms of 
a test* Thus, it is able to tap source 2* Sources 1, 2, 4, 5, and 
6» however, will be present in the set of scores, but test-retest 
reliability will not be able to detect them* Reliability estimates 
derived from the Spearman-Brown prophecy formula can tap source 4* 
Although all the other sources of score variability my be present, 
the Spearman- Brown formula cannot detect them* 

Thus, the particular estimate of reliability used can cause 
a difference in the estimated consistency of the scores* It is 
also likely that the characteristics of the examinees which should 
be measured are often not being measured as intended with these 
reliability estimates. 
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Rationale for Generalizability Theory 

Instead of yielding ? reliability coefficient aeneralizability 
theory can yield a set of generalizability coefficients, :nd doe: so 
for an important reason. One is forced to question to what sUs;qtion 
is he generalizing. That is, how consistent is a set of scores 
obtained under certain conditions? Here, the concept of universe hnd 
universe score is useful. Assuming that a population or domain of 
admissible observations of examinee performance can be defined, then 
this defined population constitutes the universe to which one could 
generalize. For instance, consider the following universe: Selected 
spelling words from the Kelly-James 10th grade spelling book, adminis- 
tered orally by one teacher on a Thursday afternoon in April. 
Assuming the measurements used were error-free, a true score could 
be obtained for each examinee for this universe. This score would be 
the examinee's universe score. Note that, for this example, if the 
universe of admissible observations is changed to include performance 
on ^0 Thursdays in April, this would be a new universe, and each 
person could well have a different universe score. Vlius, generaliza- 
bility theory is concerned with the relationship of a set of 
observed scores to the corresponding universe scores for the 
examinees. The universe of conditions for performance asiiessment 
is necessarily specified. This is the fundamental difference betv/een 
traditional reliability theory and the theory of generalizability. 
In generalizability theory, the universe that is being generalized 
to must be specified along with the admissible conditions of 
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observation for that universe. Hent,e, the sources of variation 
considered true variation, and the sources considered error are 
also specified. 

Generalizability theory can help provide answers to a number of 
questions which a person using or building a test may ask. Some of 
the more fundamental of these questions are: (a) What is the 
examinee's universe score?; (b) What amount of error is there in 
the estimation of an examinee's universe score?; (c) What are the 
sources and relative sizes of variability in examinees' scores?; 
and (d) What changes can be made in the measurement process in order 
to reduce the error in estimating an examinee's universe score? 
Each of these questions will be discussed in greater depth later in 
this paper. The model in this paper draws from the work in generaliza- 
bility theory by Cronbach et al. {1963; 1972), 

Mechanics of Generalizability Theory 

Conditions which serve to describe the universe of admissible 
observations are termed facets. Facets are analagous to factors in 
analysis-of-variance (ANOVA) designs. The basic determinations of 
generalizability analysis are achieved via an ANOVA approach. For 
example, consider a one-facet universe of different spelling words, 
with the population of words being all those in Webster's Third Edition. 
What this one-facet universe (i.e.t of words, or items) means is 
that one is only interested in making an estimate as to how well a 
person can spell all the words in one dictionary, and this deter- 
mination is made by observing performance over a single sample of 
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PERSONS 



words from the dictionary. Suppose 100 words were randomly selected 
from the dictionary and administered to a group of 100 people. The 
resulting scores could be displayed in an array such as Table 1, below. 

Table 1 

Arr-ay of Hypothetical Administration of Spelling Words 

ITEMS 

1234,,, ,., 99 100 

110 0 1 0 1 

2 0 1 0 0 0 1 

3 1110 11 



* * * 



98 0 Q 0 0 10 

99 2 1 0 1 10 
100 0 1 0 1 0 1 



Note: 1 indica'ces correct response, 0 indicates incorrect response 

These results can then be analyzed in a two-way ANOVA design, in 
which persons and items could be set up as factors with no person- 
item replications. The ANOVA results would yield three distinct 
sources of variation: ^ a) variation attributable to persons; (b) 
variation attributable to items; and fc) residual variation, A 
sample table of output from an ANOVA analysis for this example is 
displayed in Table 2, 
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Table 2 

Sample Output for Hypothetical Example 



Source 


SS 


df 


MS 


EfHS) 


Persons 


193.05 


99 


1.95 


o^res) + lOOo^iP) 


Items 


148.50 


99 


1.50 


o^{res) + lOOoMl) 


Residual 


490.05 


9801 


.05 


o*{res) 


Total 


831.60 


9,999 







The coefficient of generalizability, or the relation between observed 
and universe scores is an intraclass correlation estimated 
by: G « ff (P) , The reader rfill note that these estimated 

^{P) + o2{res) 

variance components are derived using the E(MS)'£. In calculating 

a^(res), we can use MS(res) as an unbiased, n^axijuLJU-llksllhood sstimete. 

Thus, o^{res) = HS(res) = 0.05. The variance component for items Is 

calculated in a similar manner: MS(I) - MS(res) = 1.50 - 0.05 = 

100 100 

o*{I) = .0145. Likewise, the variance component for persons is 

calculated as: MS(P) - MS(res) = .0195. The coefficient of general- 

100 

A 

izabillty is: G = .0195 = .0195 - .281. This figure gives 

.0195 + 0.05 .0695 

the estimated ratio of universe score variance to observed score 
variance, assuming the items selected are a random sample from the 
population of items, and the persons are representative of those 
which the scores are to be used for decision-making. For the one- 
facet case, using items as the facet, the coefficient of generallza- 
bility, here .28, is the same figure than would be obtained if the 
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data were analyzed for detennlming coefficient alpha or KR-20. 
Thus, alpha and KR-20 can be thought of as a special case of a 
generallzability coefficient for a universe of one facet. However, 
keep in mind the limitation that this coefficient refers only to 
administrations of siinilar sets of items to sinfiilar persons under 
exactly identical conditions* If the conditions of the test adminis- 
tration are to differ In the future, for instance? if tests are to 
be given before and after instruction, or at extremely different 
times of the day, or after long intervals of time* or ^iveu by a 
different teacher; any or all of thes*? conditions might cause some 
variation in scores which will not be accounted ?or ^n the eno-facet 
case* To remedy this, a multi-facet model Is us<sd* This is one 
advantage generalizabil ity theory enjoys over classical reliability 
theory* To see the differf^ncei we shall discuss a slightly more 
conpl ex model * 

Consider a tv/o-facet nodel of ^Uernate toruiS and occ;&sior^. 
That is, how stable are the speVi^ng scores obtained if differmit 
sets of randomly selected words {tfie aiti}rna";e forms) ere ussd, and 
scores are taken across time C^ay, from week to week)? Once again, 
an ANOVA approach v/ould be use<;* For simplicity assume that each 
examinee takes each of three tests on esch of three testing ^jccasions 
one week apart, and order of tests is randomised* The anallysis 
would proceed as though it would be done for a fully crossed 
factorial design* Total score variation in this example can be 
partitioned into seven components: (a) person variation; (b) test 
form variation; fc) occasion variation; (d) person X test interaction; 
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e) person X occasion interaction; f) test X occasion interaction; and 
g) residual variation. Thus, if three tests (10 items each) were 
administered on three different occasions to 100 examinees, the 
source table might look like the one presented in Table 3, below. 



Table 3 

Sefiiple Output for Hypothetical Example 

Source SS df M^- E(HS)* 

Persons 9375.55 99 99.75 9a^(P) + 3a^(PT) + 3o^(P0) + a^res) 

Tests 163. :0 2 81.54 300a^(T) + 3a^(Pl) + lOOa^TO) +a^(res) 

Occasions 29.16 ? 14.58 300c^(0) + 3a^(P0) + lOOa^CTO) +a2(res) 

P X T 1067.02 198 5.39 3a^(PT) + a^{res) 

P X 0 867.83 198 4.38 3a^(P0) + a^(res) 

T X 0 17.28 4 4.32 iOOff^(TO) + a=(res) 

ResMual 1537.5? 503 3.06 a^(res) 



Total 13,567.61 899 
♦Using a random- effects model 

The variance component estimates derived from this example are displayed 
1n Table 4, below. 

Table 4 

Variance Component Estimates for Example 
Component Estimate of Variation Proportion of total 





.res) 


3.06 


.205 




;to) 


0.013 


.0008 




PO 


0.44 


.029 




PT) 


0.78 


.052 




0) 


0.03 


.002 




T) 


0.25 


.017 




'P) 


10.34 


.690 
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The estimate of the general izability coefficient is: 

6 s 2-(P) ^ = 10 34 

' o^(P) + ^^(PT) + ^^(PO) + oMres) 10.34 + + .44 + 3.06 

10.34/ 14.62 » .71. Thus, there is fairly good stability of scores 
across test forms and occasions. The relative sizes of the sources 
of variation (Table 4) show that tests, occasions, and their inter- 
actions account, in sum, for just over lOSS of the total variation. 
The coefficient of generalizability would not be vastly different if 
the data were reanalyzed collapsing over: (a) occasions, making 
the one-facet model analagous to alternate forms reliability; or 
(b) tests, making the one-facet model analagous to test-retest relia- 
bility. Therefore, the two-facet model allows generalization to 
several universes— that of two facets, that of the first facet only, 
that of the ^sccrid facet only, and variations on each, such as 
nested designs, fixed, and mixed iiiodels, and so on. The usage of 
generalizability theory allows much more flexibility in the analysis 
of performance assessments, thus more realistically reflecting the 
real world. There are many other considerations and analyses in 
generalizability theory, but for the purposes of this paper, we 
may stop at this point* 

Description of the Model 

Methodology and an Example 

In evaluating alternative performance assessment strategies, a 
preliminary decision must be made concerning the face validity of 
each strategy* If a proposed alternative does not meet this first 
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requirement, then there is little value in attempting to use it. As 
an example, consider the skill of using an electric adding machine. 
While verbal aptitude test scores may correlate moderately with a 
person's facility in using an electric adding machine, few people 
would be satisfied with verbal aptitude tests as an alternative to 
performance assessment using an adding machine. Of course, the 
face validity determination should be made by those persons who 
have to make decisions about the examinees and/or learning situation* 
After one or more alternate measures of the desired performance 
have been selected, carefully constructed, and tried out with a 
few representative examinees, a study of the alternate nsethods can 
be designed. If possible, the actual performance, or the singulation 
nearest to it should be included as one of the tasks* The purpose 
for Inclusion of the act'jal performance Is to provide scores which 
are as nearly error-free as possible, for determining the potential 
for misclassification in the alternate methods (this is discussed in 
more detail later), and for individual comparison of alternate 
strategies* In designing the study, the most powerful desljn is a 
fully crossed one, as in the examples discussed above* In the 
fully crossed desigvi, each examinee is administered all tasks under 
all conditions deemed relevant enough to be included as a facet 
In the design* *'Most powerful" refers to the precision of the 
Variance component estimates, the 6-coeffic1ent estimate, and the 
error estimates* However, nearly any nested or mixed design can be 
utilized* I^ this is the case, however, some of the variance com- 
ponents which could be estimated in the crossed design may not be 
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directly estimable* 

After the study is designed and carried out, the results should 
be analyzed and interpreted* Finally, the potential for misclassifi- 
cation should be considered under each alternative strategy* Costs 
for the alternate methods should also be taken into account* One 
method for using this information is presented by way of an example 
given below* 

Consider the following example* The behavior of interest is to 
diagnose a fault requiring overhaul in a V-8 automotive engine and 
to oveWiaul and repair the engine* Relevant conditions might include 
exercise to be completed within 150% of manufacturer's recommended 
flat-rate time; and each learner to execnte the task alone* The 
criterion for successful performance is all operational checks on 
finished engine meeting manufacturer's specifications* Now, 
possible factors prohibiting such an exercise might be: lack of 
ample automobiles equally in need of engine overhaul; lack of up 
to forty hours "free time" for students to perform such an exercise; 
and lack of supemsory personnel to monitor many students* Hence, 
the case for some departure from reality is rather strong. Some 
reasonable alternative strategies might be: (a) allow the students 
to diagnose the fault in engines from five cars as well as describe 
the required repairs, or instead perform five small tasks involved 
in a complete engine overhaul on each of two engines; (()) verbally 
describe to an examiner the proper sequence of steps to follow when 
performing an engine overhaul; and fc) respond to a short-answer 
paper-and-pencil test composed of items dealing with diagnosis and 
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overhaul of an automotive engine. These alternatives correspond to 
levels and d of Lindquist's levels of measurenient reality^ 

respectively. 

Suppose each of the ten learners selected were examined in 
each of the methods outlined above as well as being given a car 
in need of an engine overhaul and told to diagnose and repair the 
problem- Suppose further that the actual measure was scored 
0 or 1, depending upon whether the rebuilt engine mtt all the 
manufacturer's operating specifications^ alternatives a and b were 
scored 0 to 10^ and the paper-and-pencil test was twenty items 
in lengthy each counting as one point. The order of these tasks 
could be randomly determined for the examinees so order effects 
would minimized. Note that the assumption was made that all 
the alternatives met minimum face valiriity requireT.ents. This 
describes a one-facet general izability model. Suppose the score 
matrix in Table 5^ below^ resulted from the study. 

Table 5 

Hypothetical Score Matrix ?or Ten Examinees on Four Different Tasks* 

Task 



Examinee/ 


1 


2 


3 


4 


1 


1 


10 (1) 


10 (i; 


\ 20 (1) 


2 


0 


5 0) 




I 15 0) 


3 


1 


9 1) 




\ 18 il) 


4 


0 


2 (0) 


4 0] 


\ 12 (0) 


5 


0 


1 (0) 


2 0' 




6 


1 


10 il) 


9 (l! 


' 1^ ai 


V 


1 


10 (1) 


10 (i; 


\ 20 (1) 


8 


1 


10 1) 


10 (i! 


\ 20 (1) 


9 


I 


4 0) * 


5 (o: 




10 


0 


8 (1) 


8 (1* 


1 \l \\] 



♦Numbers in parentheses are binary results given an arbitrary 80S£ 
criterion for "success" on each task. Task 1 is actual task^ and 
tasks 2^ 3^ and 4 are alternative strategies a* b^ and c^ above. 
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Table 6, below, lists the results of a two-way AKOVA, using the 
binary scores from Table 5. 

Table 6 
Results of ANOVA 



Sourc:> 


SS 


df 


MS 


E(MS)* 


Persons 


6,6 


9 


0,73 




Task? 


0.3 


3 


0,10 


a^res) + lOo^(T) 


Residual 


2.2 


27 


0.08 


a^{res) 


Total 


9.1 


39 







*Using random- effects model 

The resulting variance component estimates are listed below, in 
Table 7. 

Table 7 

Variance Component Estimates for Example 

Component Estimate % of Total 

Persons ,1625 66 
Tasks .002 01 
Resid ual ^08 33 

The G-coefficient, the measure of the degree of consistency of 

A 

performance across tasks is: G -a^(P)/(a^(P) + a^(res))= .1625/. 2425 = 
~ .67, Since between-task variation accounts for only 1% of the total 
variation, if the cost of using the actual task is too great, the 
less realistic tasks could be used with little loss of information. 
The G-coefficient of .67 can be interpreted as the ratio of universe- 
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score variance to observed- score variance* Further analyses could 
be performed repeating the above, comparing individual tasks to the 
actual behavior. For instance, the 6-coefficients for comparing 
alternate tasks two at a time are: *59, *59, and .82, corresponding 
to comparing tasks 1 and 2, 1 and 3, and 1 and 4, respectively* 
Interestingly enough, for this example, the usage of the paper-and- 
pencil test with an 80SS criterion yields the least loss of information* 
The study could have been performed without setting criterion levels 
on the alternate tdsks, and the resulting coefficients would not 
be drastically changed* The reason for the binary scores is for 
discussion of misclassification, in the next section* Finally, any 
study like this example would strive to include as many examinees 
as possible* The greater the number of examinees and levels of 
facets, the more dopcTidable are the variance coniponent estimates. 

Errors and Decision-making with Results 

The difference in an observed score X^jj for examinee i on task 
j and his universe score, (e*g*, X^^ - n^) is the error. A* 
A is analagous to the standard error of measurement in classical test 
theory* The size of the error A reflects the amount of information 
loss due to departures in task fidelity in the simulated tasks* A 
means for determining whether this loss is reasonable or not can 
be easily developed. First, the calculation of the error A should 
be explained* Since A reflects (average) within-person variation, 
A is calculated from the estimates of those variance components 
considered to be within persons* For the example, the calculation 
of A for the study is given in Table 8* 
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Table 8 



Calculation of Error A 



Variance component estimate 



Observations within 
persons 



Contribution to 



oHJ) = .002 
a^(res) = .08 



4 



4 



.0005 



a^{A) = .0205 
a (A) - .145 



Using this computation procedure, the expected size of a^C^) for 
any future study can be calculated, the same variance component 
estimates are used, and the (expected) number of within-person obser- 
vations is used. For instance, for scores from five randomly- selected 
tasks, instead of four as in the example, the expected size of a(A) 
is -13, about a 10^ reduction. The same calculation procedure could 
be used with any number of different facets, althougn the variance- 
component estimates would be needed for the additional within-person 
facets. 

The non-symmetrical nature of confidence intervals is aptly 
discussed by Cronbach et al. (1972), hence this discussion will only 
include the conservative Chebychev approach* For a randomly selected 
examinee in the example, a 755! confidence interval is given by 
±.3, where a{A) is used to obtain the .3. 

Next, a threshold error level (TEL) has to be defined by the 
decision-makers. This is the size of the error A such that any 
values less than or equal to it are considered trivial, and any 
values greater are considered significant. Tile determination of the 
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TEL can be approached by setting a given level of savings desired in 
the total testing costs. That is, how much more economical does a 
procedure alternate to the actual performance test have to be in 
order to justify its use? {In a practical sense, this begs the 
question of not using the actual performance itself, for whatever 
the reason.) Suppose that in terms of personnel time alone, the 
cost was $20 per examinee to use the actual behavior for the 
performance test. For a group of ten examinees, the total cost is 
$200, Now, consider the cost of misclassification in terms of 
testing time alone. Both a false positive and false negative 
misclassification would require additional testing, but the false 
negative misclassification would constitute the only added cost, 
since the false positive misclassification would likely eventually 
have to be retested an.W/ay, Suppose the cost for the most expen- 
sive alternate measurement strategy was $7,50 per man. Looking 
at the original score matrix (Table 5), u maximum of two misclassi- 
fications can be detected using any of the alternate tasks. 
Adding this to the original cost for testing ten persons makes the 
alternate task cost $90, Alternate task usage with the presently- 
set TEL (a) results in more than a 50* savings. As a general rule, 
therefore, once the desired amount of savings is determined (as long 
as it does not exceed an error-free cost of testing using the least 
expensive alternate measurement strategy), the TEL corresponding to 
that amount of savings can be compared with the observed size of the 
error A, If A 5 TEL, an alternate procedure is usable, and, according 
to this decision process, "reasonable." If A > TEL, then the number 
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of individual items required to reduce A to the TEL can be calculated, 
as explained above. This result is the desired length of the alternate 
task, or, in an analagous fashion, the number of alternate tasks which 
need be administered. 

Summary 

The advantage of the generalizability approach is obvious—not 
only can multiple levels of facets be considered simultaneously 
(something the product-inonjent correlation could not do), but it can 
incorporate multiple facets, and yields information on the relative 
sizes and sources of score variation. Also, betwaen-person variation 
is not essential for useful results for decision-makers. The results 
of such an analysis can be used to aid in a rational, empirically- 
based decision for detemnning an appropriate measuretnent strategy. 
Cost-effectiveness decisions can also be made if the loss of infor- 
mation (expressed as the size of the error of measurement) due to 
different measurement approaches can be quantified on the same scale 
as the cost of testing. 

For the field of performance testing, the authors conclude that a 
generalizability theory approach to dealing with departures from 
reality in testing can aid in the establishment of empirically-based 
choices of measurement strategies. 
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